The AMARA Corpus: Building Resources for Translating the Web’s Educational Content
نویسندگان
چکیده
In this paper, we introduce a new parallel corpus of subtitles of educational videos: the AMARA corpus for online educational content. We crawl a multilingual collection community generated subtitles, and present the results of processing the Arabic–English portion of the data, which yields a parallel corpus of about 2.6M Arabic and 3.9M English words. We explore different approaches to align the segments, and extrinsically evaluate the resulting parallel corpus on the standard TED-talks tst-2010. We observe that the data can be successfully used for this task, and also observe an absolute improvement of 1.6 BLEU when it is used in combination with TED data. Finally, we analyze some of the specific challenges when translating the educational content.
منابع مشابه
The AMARA Corpus: Building Parallel Language Resources for the Educational Domain
This paper presents the AMARA corpus of on-line educational content: a new parallel corpus of educational video subtitles, multilingually aligned for 20 languages, i.e. 20 monolingual corpora and 190 parallel corpora. This corpus includes both resource-rich languages such as English and Arabic, and resource-poor languages such as Hindi and Thai. In this paper, we describe the gathering, validat...
متن کاملTranslation Evaluation in Educational Settings for Training Purposes
The following article describes different methods and techniques used in educational settings for translation evaluation. Translation evaluation is the placing of value on a translation i.e. awarding a mark, even if only a binary pass/fail one. In the present study, different features of the texts chosen for evaluation were firstly considered and then scoring the t...
متن کاملNorms of Translating Taboo Words and Concepts from English into Persian after the Islamic Revolution in Iran
The research attempted to discover the norms of translating taboo words and concepts after the Islamic Revolution in Iran using Toury’s (1995) framework for classification of norms. The corpus of the study composed of Coelho’s novels between 1990 and 2005 and their Persian translations which were prepared and analyzed manually to discover the norms. During both the selection of novels for trans...
متن کاملLeveraging Content from Open Corpus Sources for Technology Enhanced Learning
As educators attempt to incorporate the use of educational technologies in course curricula, the lack of appropriate and accessible digital content resources acts as a barrier to adoption. Quality educational digital resources can prove expensive to develop and have traditionally been restricted to use in the environment in which they were authored. As a result, educators who wish to adopt thes...
متن کاملSlicepedia: Automating the Production of Educational Resources from Open Corpus Content
The World Wide Web (WWW) provides access to a vast array of digital content, a great deal of which could be ideal for incorporation into eLearning environments. However, reusing such content directly in its native form has proven to be inadequate, and manually customizing it for eLearning purposes is labor-intensive. This paper introduces Slicepedia, a service which enables the discovery, reuse...
متن کامل